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ABSTRACT 



A system and method are disclosed for providing a gesture 
recognition system for recognizing gestures made by a 
moving subject within an image and performing an opera- 
tion based on the semantic meaning of the gesture. A subject, 
such as a human being, enters the viewing field of a camera 
connected to a computer and performs a gesture, such as 
flapping of the arms. The gesture is then examined by the 
system one image frame at a time. Positional data is derived 
from the input frames and compared to data representing 
gestures already known to the system. The comparisons are 
done in real-time and the system can be trained to better 
recognize known gestures or to recognize new gestures. A 
frame of the input image containing the subject is obtained 
after a background image model has been created. An input 
frame is used to derive a frame data set that contains 
particular coordinates of the subject at a given moment in 
time. This series of frame data sets is examined to determine 
whether it conveys a gesture that is known to the system. If 
the subject gesture is recognizable to the system, an opera- 
tion based on the semantic meaning of the gesture can be 
performed by a computer. 

22 Claims, 11 Drawing Sheets 
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METHOD AND APPARATUS FOR REAL- desirable to have a system that would allow more control 

TIME GESTURE RECOGNITION over the training and recognition of gestures. 

CROSS REFERENCE TO RELATED SUMMARY OF THE INVENTION 

APPLICATION s ^ c p resCD t invention provides a system for recognizing 

This is a continuation of Sen No. 08/951,070 filing date gestures made by a subject within a sequence of images and 

Oct. 15, 1997, now U.S. Pat. No. 6,072,494, this application performing an operation based on the semantic meaning of 

is related to co-pending U.S. Pat. application Ser. No. filed the gesture. In a preferred embodiment, a subject, such as a 

herewith, and are incorporated herein by reference for all human bein g> enters me viewing field of a camera connected 

purposes • 10 to a computer and performs a gesture. The gesture is then 

examined by the system one image frame at a time. Posi- 

BACKGROUND OF THE INVENTION tional data is derived from the input frame and compared to 

previously derived data representing gestures known to the 

1. Background system. The comparisons are done in real time and the 
The present invention relates generally to methods and 15 system can be trained to better recognize known gestures or 

apparatus for computer-implemented real-time gesture rec- to recognize new gestures. 

ognition. More particularly, the present invention relates to In a preferred embodiment, a computer-implemented ges- 
capturing a sequence of images of a subject moving subject ture recognition system is described. A background image 
performing a particular movement or gesture; extracting modcl ^ crcated by cxam i n i ng f ramcs of an average back- 
relevant data points from these images and comparing the 20 ground i mage before the subject that will perform the 
resulting sequence of data points to patterns of data points gesture cntcrs ^ image A framc of the input imagc 
for known gestures to determine if there is a match. containing the subject, such as a human being, is obtained 

2. Prior Art after the background image model has been created. The 
An emerging and increasingly important procedure in the frame captures the person in the action of performing the 

field of computer science is gesture recognition. In order to gesture at one moment in time. The input frame is used to 

make gesture recognition systems commercially useful and derive a frame data set that contains particular coordinates of 

widespread, they must recognize known gestures in real- the subject at that given moment. These sequence of frame 

time and must do so with minimum or reduced use of the data sets taken over a period of time is compared to 

CPU. From a process perspective a gesture is defined as a sequences of positional data making up one or more recog- 

time-dependent trajectory following a prescribed pattern 30 nizable gestures i.e., gestures already known to the system, 

through a feature space, e.g., a bodily movement or hand- If the gesture performed by the subject is recognizable to the 

writing. Prior art methods for gesture recognition typically system, an operation based on the semantic meaning of the 

uses neural networks or a Hidden Markov Model's gesture may be performed by the system. 

(HMM's) with HMM's being the most prevalent choice. 35 In another embodiment the gesture recognition procedure . 

A Hidden Markov Model is a model made up of inter- includes a routine setting its confidence level according to 

connected nodes or states. Each state contains information the degree of mismatch between the input gesture data and 

concerning itself and its relation to other states in the model. the patterns of positional data making up the system's 

More specifically, each state contains (1) the probability of recognizable gestures. If the confidence passes a threshold, 

producing a particular observable output and (2) the prob- 40 a material is considered found. 

abilities of going from that state to any other state in the In yet another preferred embodiment the gesture recog- 

model. Since only the output is observed a system based on nition procedure includes a partial completion query routine 

HMM's does not know which state it is in at any given time; that updates a status report which provides information on 

it only knows what the probabilities are that a particular how many of the requirements of the known gestures have 

model produces the outputs seen thus far. Knowledge of the 45 been met by the input gesture. This allows queries of how 

state is hidden from the system or application. much or what percentage of a known gesture is completed 

Examples of gesture recognition systems based on Hidden by probing the status report. This is done by determining 

Markov Models include a tennis stroke recognition system, how many key points of a recognizable gesture have been 

an American sign language recognition system, a system for me t- 

recognizing Up movements, and systems for recognizing 50 In yet another embodiment the gesture recognition pro- 
handwriting. The statistical nature of HMM's can capture cedure includes a routine for training the system to recog- 
the variance in the way different people perform gestures at nize new gestures or to recognize certain gestures performed 
different times. However, the same statistical nature makes by an individual more efficiently. Several samples of the 
HMM a "black box." For example, one state in the model subject, i.e., individual, performing the new gesture are used 
may represent one particular point in a bodily gesture. An 55 by the system to extract the number of key points, the 
HMM-based application may know many things about this dimensions, and other relevant characteristics of the gesture, 
point, such as the probabilities that the gesturer will change A probability distribution for each key point indicating the 
position or move in other directions. However, the applica- likelihood of producing a particular observable output at that 
tion will not be able to determine precisely when it has key point is also derived. Once a characteristic data pattern 
reached that point. Thus, the application is not able to is obtained for the new gesture, it can be compared to 
determine whether the person has completed 25% or 50% of patterns of previously stored known gestures to produce a 
a known gesture, confusion matrix. The confusion matrix describes possible 
Therefore, it would be desirable to have a real-time similarities between the new gesture and known gestures as 
gesture recognition system that removes the "hidden" layer well as the likelihood that the system will confuse these 
found in current systems which uses Hidden Markov Models 65 similar gestures. 

while still capturing the variance in the way different people In yet another embodiment the gesture recognition pro- 
perform a gesture at different times. In addition, it would be cedure visually displays the subject performing the gesture 



03/17/2004, EAST Version: 1.4.1 



US 6,2: 

3 

and any resulting transformations or augmentations to the 
subject on a computer monitor through model-based com- 
positing. Such a compositing method includes shadow 
reduction and hole and gap filling routines for isolating the 
subject being composited. 

In another aspect of the present invention a computer- 
based system for extracting data to be used to recognize 
gestures made by a subject is described. In a preferred 
embodiment an image modular for creating a background 
model that does not contain the subject is used to create an 
initial background model. The system includes a frame 
capturer for obtaining an image frame and a frame analyzer 
for analyzing the image thereby determining particular coor- 
dinates of the subject at a particular time. Also described is 
a data set creator for creating a frame data set from the 
particular coordinates and a data set analyzer for examining 
the coordinates in the frame data set and comparing them to 
positional data representing a known gesture. 

Advantages of the methods and systems described and 
claimed are realtime recognition of gestures made by sub- 
jects within a dynamic background image. Gestures are 
recognized and processed immediately in a computer system 
that can also be trained to recognize new gestures or to 
recognize certain known gestures more efficiently. In 
addition, the subject is composited onto a destination image 
without distorting effects from shadows cast by the subject 
or from color uniformity between the subject and the back- 
ground. This provides for a clean, well-defined composited 
subject on a display monitor which can be processed by the 
computer system according to the semantic meaning of the 
recognized or known gesture. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The invention, together with further advantages thereof, 
may best be understood by reference of the following 
description taken in conjunction with the accompanying 
drawings in which: 

FIG. 1 is a schematic illustration of a general purpose 
computer system suitable for implementing the present 
invention. 

FIG. 2 is a diagram of a preferred embodiment of the 
present invention showing a person with arms extended and 
with the image composited onto a computer monitor through 
the use of a camera. 

FIG. 3 shows a series of screen shots showing a human 
figure performing a gesture, an arm flap, and the resulting 
function performed by the system of transforming the 
human figure to an image of a flying bird. 

FIG. 4 shows another series of screen shots showing a 
human figure performing another recognizable gesture, 
jumping, and the system augmenting the human figure once 
the gesture is recognized. 

FIG. 5a is a flowchart showing a process for a preferred 
embodiment for gesture recognition of the present invention. 

FIG. 5b shows data stored in a frame data set as derived 
from a data or image frame containing a subject performing 
a gesture as described in block 502 of FIG. 5a. 

FIG. 6 is a flowchart showing in greater detail block 504 
of FIG. 5a in which the system runs the gesture recognition 
process. 

FIG. 7 is a flowchart showing in greater detail block 600 
of FIG. 6 in which the system processes the frame data to 
determine whether it matches a recognized gesture. 

FIGS. 8A and 8B are flowcharts showing a process for 
training the system to recognize a new gesture. 
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DETAILED DESCRIPTION OF THE 
PREFERRED EMBODIMENTS 

Reference will now be made in detail to a preferred 
embodiment of the invention. An example of the preferred 
embodiment is illustrated in the accompanying drawings. 
While the invention will be described in conjunction with a 
preferred embodiment, it will be understood that it is not 
intended to limit the invention to one preferred embodiment. 
To the contrary, it is intended to cover alternatives, 
modifications, and equivalents as may be included within 
the spirit and scope of the invention as defined by the 
appended claims. 
The present invention employs various processes involv- 

a5 ing data stored in computer systems. These processes are 
those requiring physical manipulation of physical quantities. 
Usually, though not necessarily, these quantities take the 
form of electrical or magnetic signals capable of being 
stored, transferred, combined, compared, and otherwise 

2Q manipulated. It is sometimes convenient, principally for 
reasons of common usage, to refer to these signals as bits, 
values, elements, variables, characters, data structures, or 
the like. It should be remembered, however, that all of these 
and similar terms are to be associated with the appropriate 

25 physical quantities and are merely convenient labels applied 
to these quantities. Further, the manipulations performed are 
often referred to in terms such as identifying, running, 
comparing, or detecting. In any of the operations described 
herein that form part of the present invention, these opera- 

3 q tions are machine operations. Useful machines for perform- 
ing the operations of the present invention include general 
purpose digital computers or other similar devices. In all 
cases, it should be borne in mind the distinction between the 
method of operations in operating a computer and the 

35 method of computation itself. The present invention relates 
to method blocks for operating a computer in processing 
electrical or other physical signals to generate other desired 
physical signals. 

The present invention also relates to a computer system 

40 for performing these operations. This computer system may 
be specially constructed for the required purposes, or it may 
be a general purpose computer selectively activated or 
reconfigured by a computer program stored in the computer. 
The processes presented herein are not inherently related to 

45 any particular computer or other computing apparatus. In 
particular, various general purpose computing machines 
may be used with programs written in accordance with the 
teachings herein, or it may be more convenient to construct 
a more specialized computer apparatus to perform the 

50 required method blocks. 

FIG. 1 is a schematic illustration of a general purpose 
computer system suitable for implementing the process of 
the present invention. The computer system includes a 
central processing unit (CPU) 102, which CPU is coupled 

55 bi-directionally with random access memory (RAM) 104 
and unidirectionally with read only memory (ROM) 106. 
Typically RAM 104 includes programming instructions and 
data, including text objects as described herein in addition to 
other data and instructions for processes currently operating 

60 on CPU 102. ROM 106 typically includes basic operating 
instructions, data and objects used by the computer to 
perform its functions. In addition, a mass storage device 108, 
such as a hard disk, CD ROM, magneto -optical (floptical) 
drive, tape drive or the like, is coupled bi-directionally with 

65 CPU 102. Mass storage device 108 generally includes 
additional programming instructions, data and text objects 
that typically are not in active use by the CPU, although the 
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address space may be accessed by the CPU, e.g., for virtual ated with that gesture. Examples of this are shown in FIGS, 
memory or the like. Each of the above described computers 3 and 4 below. In other preferred embodiments, the person's 
further includes an input/output source 110 that typically image does not need to be composited onto a destination 
includes input media such as a keyboard, pointer devices image or displayed on the computer monitor. The system can 
(e.g., a mouse or stylus) and the like. Each computer can also 5 simply recognize the gesture and perform an operation, 
include a network connection 112 over which data, without having to composite the image of the person. In a 
including, e.g., text objects, and instructions can be trans- preferred embodiment, although the person may be located 
ferred. Additional mass storage devices (not shown) may in a room with background items that are static, such as 
also be connected to CPU 102 through network connection furniture, or non-static, such as a television screen or open 
112. It will be appreciated by those skilled in the art that the 10 window showing moving objects, such items are not corn- 
above described hardware and software elements are of posited onto a destination image; only the human figure is 
standard design and construction. composited. 

As discussed above, Hidden Markov Models are typically FIG. 3 shows a series of screen shots showing a human 
used in current gesture recognition systems to account for figure performing a gesture — in this case an arm flap — and 
variance in possible movements in a gesture. The present 15 the resulting function performed by the system, i.e. trans- 
invention uses the HMM construct and removes the hidden forming the human figure to other images of a flying bird. In 
nature of the model by allowing the application to determine other preferred embodiments, the human figure can perform 
which state in the model it is in. The present invention also other types of gestures and be transformed to another figure 
forces the application to move in a certain direction by or be augmented, as shown in FIG. 4 below. At 300 of FIG. 
removing all the connections from a particular state to the 20 3 » tnc person is initially flapping her arms up and down at a 
other states except for one. For example, at state one in a rate acceptable to the system. This rate can vary in various 
Hidden Markov Model, an application may be able to go to embodiments but is generally dependent on factors such as 
states two, three, or four. State one would have the prob- camera frame speed or CPU clock speed. At shots 302 and 
abilities that from it, the gesture would go to any one of the 304, the person is moving her arms up and down in full 
those states. In a preferred embodiment of the present 2 5 ran S e and ^ performing the complete gesture of arm flap- 
invention, the connections to states three and four are ping. Once this is done and the system recognizes the 
removed, thus forcing the application or system to go to state gesture, the system transforms the person to a bird as shown 
two or to stay in state one. It should be noted that the HMM at shot 306. Transforming the human figure to a bird is one 
construct also allows for this case, which is generally known example of a function or operation the computer can per- 
as the left-to-right HMM. However, in an HMM implement, 30 f° r ra once it recognizes the arm flapping gesture. More 
state one will have two probabilities: one indicating the generally, once recognized the computer can perform any 
probability that it will stay in state one and another that it type of function that the computer was programmed to 
will go to state two. In the present invention, there are no perform upon recognition, such as, changing applications or 
transition probabilities. The application will stay in state one turning the computer on or off. Performing the recognized 
until it meets the criteria, such as reaching a local extrema 35 gesture is essentially the same as pressing a key on the 
for moving to state two. Also included in a preferred keyboard or clicking a button on a mouse, 
embodiment of the present invention is a timing constraint FIG. 4 shows another example of a preferred embodiment 
built into the application. This timing constraint applies to where the human figure performing a recognizable 
individual states in the model. For example, a state may have gesture — in this case jumping up and down — is augmented 
a timing constraint such that the person cannot stay in a 40 with a new hat by the system once it recognizes the gesture, 
particular pose or position in the gesture for more than a In this example, the figure or subject is not transformed as 
predetermined length of time. Furthermore, by removing the in FIG. 3, but rather is augmented (i.e., a less significant 
hidden layer in the HMM, the system can determine at any change to the figure) by having an object, the hat, added to 
time how much of a particular gesture has been completed it. At shot 400 the figure is standing still. At shots 402 and 
since the system knows what state the gesture is in. 45 404 the figure is shown jumping straight up and down at an 

In another preferred embodiment of the gesture recogni- acceptable rate to the system as described above. Once this 

tion system of the present invention, a training interface is gesture is recognized by the system, the computer performs 

included which requires a small degree of human interven- the function of augmenting the figure by placing a hat 00 the 

tion. A person can "teach" the system new gestures for it to figure's head as shown at 406. As described above, this 

recognize by performing samples of the new gesture in front 50 system can perform any type of function that it could 

of a camera. The user can then enter certain information normally perform from a user pressing a key or clicking a 

about the new gesture allowing the system to create a model mouse, once it recognizes the gesture. This gesture recog- 

of the new gesture to store in its library. FIG. 2 is a diagram nition and training process is described in greater detail with 

of a preferred embodiment of the present invention showing respect to FIGS. 5 through 9. 

a person with arms extended and having the image com- 55 FIG. 5a is a flow diagram showing a process for a 

posited on a computer monitor through the use of a camera. preferred embodiment of object gesture recognition of the 

It shows a computer 206 connected to a camera 200. In other present invention. At 500, the system creates or digitally 

preferred embodiments, the camera can be located further builds a background model by capturing several frames of a 

away from the computer. Camera 200 has within its range or background image. The background image is essentially the 

field of vision, a person 202 with her arms extended, as if in 60 setting the system is being used in, for example, a child's 

the middle of an arm flap gesture. In a preferred playroom, an office, or a living room. It is the setting in 

embodiment, the image of person 202 performing the ges- which the subject, e.g. a person, will enter and, possibly, 

ture is composited onto a destination image 208 which is perform a gesture. A preferred embodiment of creating a 

displayed on a computer monitor as shown in FIG. 2. background model is described in an application titled 

Assuming one of the system's recognizable gestures is arm 65 "Method and Apparatus for Model-Based Compositing" by 

flapping, once the system recognizes that the person is inventor Subutai Ahmad, assigned to Electric Planet, Inc., 

performing this gesture it will perform an operation associ- filed on Oct. 15, 1997. 
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Once the background model is created in block 500, in a and left arms and the width can be the person's shoulder 
preferred embodiment, the system preprocesses an image span. In other preferred embodiments, the coordinates can 
frame within which the subject is performing a particular be of other significant or relevant portions depending on the 
gesture in block 502. In a preferred embodiment, this subject performing the movements and the type of move- 
preprocessing involves compositing the object onto a des- 5 ment. The frame data set contains information on the posi- 
tination image and displaying the destination image on a tions (via x and y coordinates) of significant or meaningful 
computer monitor, as described with respect to FIG. 2 above. portions of the subject's "body". What is significant or 
The compositing process can involve sub-processes for meaningful can depend on the nature and range of gestures 
reducing the effect of shadows and filling holes and gaps in expected to be performed by the object or that are recog- 
the object once composited. The destination image can be an 10 nized or known to the system. For example, the left and right 
image very different from the background image, such as an extremities of a person are significant because one of the 
outdoor scene, outer space, or other type of imaginary scene. recognizable gestures is flapping of the arms which is 
This gives the effect of the person performing a gesture, and determined by the movement of the ends of the person's 
being augmented or transformed, in an unusual environment arms. In a preferred embodiment, each image or data frame 
or setting. A preferred embodiment of the compositing 15 captured has a corresponding frame data set. The sequence 
process is described in detail in copending application titled of frame data sets is analyzed by the gesture recognition 
"Method and Apparatus for Model-Based Compositing*' by process as shown in block 504 of FIG. 5a and described in 
inventor Subutai Ahmad, assigned to Electric Planet, Inc., greater detail in FIGS. 6 and 7. As will be described in 
filed on Oct. 15, 1997. greater detail below, information from the frame data set is 

At 504 the system analyzes the person's gesture by 2 o extracted in various combinations and can also be scaled as 

performing a gesture recognition process using as data a needed by the system. For example, with an arm flapping 

sequence of image frames captured in block 502. Apreferred gesture the system would extract width coordinates, coor- 

embodiment of the gesture recognition process is described dinates of right and left extremities, and center of mass 

in greater detail with respect to FIG. 6. The gesture recog- coordinates, and possibly others. Essentially, the frame data 

nition process is performed using a gesture database as 25 set indicates the location of significant parts of the moving 

shown in block 506, Gesture database 506 contains data subject at a given moment in time. 

arrays representing gestures known to the system and other FIG. 6 is a flow diagram showing in greater detail block 
information such as status reports, described in greater detail 504 of FIG. 5a. In step 600 the system processes the frame 
below. The gesture recognition process deconstructs and data for a known gesture (gesture #1). This process is 
analyzes the gesture or gestures being made by the person. 30 repeated for each known gesture contained in the gesture 
At 508 the system determines whether the gesture performed database shown in FIG. 5 as item 506. Once the frame data 
by the person is actually a recognized or known gesture. The has been compared to gesture data as shown in blocks 600 
system has a set of recognizable gestures to which the through 604 (known gesture #N), the system then deter- 
gesture being performed by the person is compared. The mines whether the gesture made by the moving subject 
data representing the recognizable gestures is stored in data 35 meets any of the completion requirements for the known 
arrays, described in greater detail with respect to FIG. 6 gestures in the system in block 606. If the moving subject's 
below. If the gesture performed by the person is a recog- gesture does not meet the requirements for any of the known 
nizable gesture, the system proceeds to block 510. gestures, control returns to block 502 of FIG. 5 in which the 
At 510 the system performs a particular function or system preprocesses a new frame of the moving subject. If 
operation based on the semantic meaning of the recognized 40 the moving subject's gesture meets the requirements of any 
gesture. As described above this meaning can translate to of the known gestures, the system then performs an opera- 
transforming the person to another figure, like a bird, or tion based on the semantic meaning of the recognized 
augmenting the person, for example, by adding a hat. Once gesture. For example, if the gesture by the moving object is 
the system recognizes a gesture and performs an operation recognized to be a flapping gesture, the system can then 
based on the gesture, the system returns to block 502 and 45 transform the human figure on the monitor into a bird or 
continues analyzing image frames of the person performing other objects. The transformation to an image of a bird 
further gestures. That is, even though the person has per- would be an example of a semantic meaning of the arm 
formed a gesture recognizable to the system and the system flapping gesture. 

has carried out an operation based on the gesture, the FIG. 7 is a flowchart showing in greater detail block 600 

processing continues as long as the image frames are being 50 of FIG. 6 in which the system processes the frame data to 

sent to the system. The system will continue processing determine whether it matches the completion point of a 

movements by the person to see if they match any of its known gesture. At 700 the system begins processing a frame 

recognizable gestures. However, if the gesture performed by data set representative of a captured image frame. An 

the person is not recognized by the system, control also example of a frame data set is shown in FIG. 5b, As 

returns to block 502 where the system captures and prepro- 55 described above, the frame data set contains coordinates of 

cesses the next frame of the person continuing performance various significant positions of the moving subject. The 

of a gesture (ie. the person's continuing movements in front frame data set contains information on the moving subject at 

of the camera). one particular point in time. As will be described below, the 

FIG. 5b shows data stored in a frame data set as derived system continues capturing image frames and, thus, deriving 

from an image frame containing the person performing a eo frame data sets, as long as there is movement by the subject 

gesture as described in block 502 of FIG. 5a. In a preferred within view of the camera. 

embodiment, the frame data set shown in FIG. 5b contains At 702 the system will extract from the frame data set 

x and y coordinate values of certain portions of a person positional coordinates it needs in order to perform a proper 

performing a gesture. For example, these portions can comparison with each of the gestures known or recognizable 

include: a left extremity, a right extremity, a center of mass, 65 to the system. For example, a known gesture, such as 

width, top of head, and center of head. In this example, the squatting, may only have two relevant or necessary coordi- 

left and right extremities can be the end of a person's right nates that need to be checked, such as top of head and center 
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of mass. Other coordinates do not need to be checked in 
order to determine whether a person is performing a squat- 
ting movement. Thus, in block 702 the system extracts 
relevant coordinates from the frame data set (in some cases 
it may be all the available coordinates) for comparison to 
known gestures. 

At 704 the system compares the extracted positional 
coordinates from the frame data set to the positional coor- 
dinates of a particular point of the characteristic pattern of 
each known gesture. Each of the known gestures in the 
system is made up of one or more dimensions. For example, 
the flapping gesture may have four dimensions: normalized 
x and y for the right arm and normalized x' and y' for the left 
arm. A jump may have only two dimensions: one for the 
normalized top of the head and another for the normalized 
center of mass. Each dimension turns out a characteristic 
pattern of positional coordinates representing the expected 
movements of the gesture in a particular space over time. 
The extracted positional coordinates from the frame data set 
is compared to a particular point along each of these 
dimensional patterns for each gesture. 

Each dimensional pattern has a number of key points, also 
referred to as states. A key point can be a characteristic pose 
for a particular gesture. For example, in an arm flapping 
gesture, a key point can be when the arms are at the highest 
or lowest positions. In the case of a jump, a key point may 
be when the object reaches the highest point. Thus, a key 
point can be a point where the object has a significant change 
in direction. Each dimension is typically made up of a few 
key points and flexible zones which are the areas between 
the key points. At 706 the system determines whether a new 
state has been reached. In the course of comparing the 
positional data to the dimensional patterns, the system 
determines whether the input (potential) gesture has reached 
a key point for any of the known gestures. Thus, if a person 
bends her knees to a certain point, the system may interpret 
that as a key point for the jump gesture or possibly a 
squatting or sitting gesture. Another example is a person 
moving her arms up to a certain point and then moving them 
down. The point at which the person begins moving her arms 
down can be interpreted by the system as a key point for the 
arm flap gesture. At 708 the system will make this determi- 
nation. If a new state has been reached for any of the 
gestures, the system updates a status report to reflect this 
event at 710. This informs the system that the person has 
performed at least a part of one known gesture. 

This information can be used for a partial completion 
query to determine whether a person's movement is likely to 
be a known gesture. For example, a system can inquire or 
automatically be informed when an input gesture has met 
three-quarters or two-thirds of a known gesture. This can be 
determined by probing the status report to see how many 
states of a known gesture have been reached. The system can 
then begin preparing for the completion of the known event. 
Essentially, the system can get a head start in performing the 
operation associated with the known gesture. 

At 712 the system checks whether there is a severe 
mismatch between data from the frame data set and the 
allowable positional coordinates for each dimensional pat- 
tern of each known gesture. A severe mismatch would result, 
for example, from coordinates indicating a change in direc- 
tion that clearly shows that the gesture does not conform to 
a particular known gesture (e.g., an arm going up when the 
system would expect it to go down for a certain gesture). A 
severe mismatch would first be detected at one of a known 
gesture's key points. If there is a severe mismatch the system 
resets the data array for the known gesture with which there 
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was a mismatch at block 714. The system maintains data 
arrays for each gesture in which the system stores informa- 
tion regarding the "history" of the movements performed by 
the person and captured by the camera. This information is 

5 no longer needed if it determined that it is highly unlikely 
that the movements by the person will match a particular 
known gesture. Once these data arrays are cleared so they 
can begin storing new information, the system also resets the 
status reports to reflect the mismatch at block 716. By 
clearing the status report regarding a particular gesture, the 
system will not provide misleading information when a 
partial completion query is made regarding that gesture. The 
status report will indicate, at the time there is a severe 
mismatch, that no part of the particular gesture has been 
completed. At 718 the system will continue obtaining and 

15 processing input image frames of the person performing 
movements in the range of the camera as shown generally in 
FIG. So. 

Returning to block 708, if a new state has not been 
reached for any of the known gestures, the system continues 

20 with block 712 where it checks for any severe mismatches. 
If there are no severe mismatches, the system checks 
whether there is a match between the coordinates in the 
frame data set and any of the known gestures in block 720. 
Once again, this is done by comparing the positional coor- 

25 dinates from the frame data to the coordinates of a particular 
point along the characteristic pattern of each dimension of 
each of the known gestures. If there is a less- than -severe 
mismatch, but a mismatch nonetheless, between the posi- 
tional coordinates and a known gesture, the most recent data 

3 Q in the known gesture's data arrays is kept and older data is 
discarded at 722. This is also done if a timing constraint for 
a state has been violated. This can occur if a person holds a 
position in a gesture for too long. In a preferred embodiment, 
the subject's gesture should be continuous. New data stored 

35 in the array is stored from where the most recent data was 
kept. The system then continues obtaining new image input 
frames as shown in block 718. 

If the system determines that the movements performed 
by the person matches a known gesture, a recognition flag 

40 for that gesture is set at 724. A match is found when the 
sequence of positional coordinates from consecutive frame 
data sets match each of the patterns of positional coordinates 
of each dimension for a known gesture. Once a match is 
found, the system can perform an operation associated with 

45 the known and recognized gesture, such as transforming the 
person to another image or augmenting the person, as shown 
on a computer monitor. However, the system will also 
continue obtaining input image frames as long as the person 
is moving within the range of the camera. Thus, control 

50 returns to block 718. 

In a preferred embodiment of the present invention, it is 
possible for the user to enter new gestures into the system, 
thereby adding them to the system library of known or 
recognized gestures. One process for doing this is through 

55 training the system to recognize the new gesture. The 
training feature can also be used to show the system how a 
particular person does one of the already known gestures, 
such as the arm flap. For example, a particular person may 
not raise her arms as high as someone with longer arms. By 

60 showing the system how a particular person performs a 
gesture, the system will be more likely to recognize that 
gesture done by that person and recognize it sooner and with 
a greater confidence level. This is a useful procedure for 
frequent users or for users who pattern one gesture fre- 

65 quently. 

FIGS. 8A and 8B are flowcharts showing a process for 
training the system to recognize a new gesture. At 800 the 
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system collects samples of the new gesture. One method of of the subject's movements. If the system does not detect 

providing samples of the new gesture is for a person to enter any additional movements by the subject, it proceeds to 

the field of the camera and do the gesture a certain number block 814. 

of times. This, naturally, requires some user intervention. In At 814 the system updates a gesture confusion matrix, 

a preferred embooamenMhe user or users perform the new 5 ^ matrix has an ent for each ture known t0 the 

gesture about 30 umes. The number of users and the number m ^ m , he Qewl trajned e iml 

of samples have a direct bearing on the accuracy of the ■ , „ • iU ... f A , . f T r, u . 

, , K *• *u * J ClL existing gestures in the horary for confusabihty. If the newly 

model representing the new gesture and the accuracy of the . • j . . L1 , L1 ... J 

statistics of each key point (discussed in greater detail lrained * esture 13 h ^ ^^able ™ th one or more 

below). The more representative samples provided to the n C3Q J?8 grimes, it should be retrained using more features 

system, the more robust the recognition process will be. 10 or dlfferent features * In a P re ^d embodiment the matrix 

At 802 the number of key points in the gesture is entered would be made u ? u °f rows and colum J DS ' m which the 
as well as the complete time it takes to finish one full columns ^present the known gestures and the rows repre- 
gesture, from start to finish. Essentially, in blocks 800 and sent or data on cach of me gestures. A cell in which 
802, the system is provided with a sequence of key points the data for a gesture, for example, jump, intersects with the 
and flexible zones. The number of key points will vary 15 jump column, should contain the highest confusabihty indi- 
depending on the complexity of the new gesture. The key cator - In another example, a cell in which a jump column 
points determine what coordinates from the input frame data intersects with a row for arm flap data should contain a low 
set should be extracted. For example, if the new gesture is confusability factor or indicator. Once the confusion matrix 
a squatting movement, the motion of the hands or arms is has been set for the newly entered gesture, the system 
irrelevant. At 804 the system determines what dimensions to 20 continues monitoring for additional movements by the sub- 
use to measure the frame data set. For example, a squatting ject starting with block 502 of FIG. 5a. Although the 
gesture may have two dimensions whereas a more complex foregoing invention has been described in some detail for 
gesture may have four or five dimensions. In block 806 the purposes of clarity of understanding, it will be apparent that 
system determines the location of the key points in a model certain changes and modifications may be practiced within 
representing the new gesture based on the starting and 2 5 the scope of the appended claims. For example, the image of 
ending times provided by the user Hie system does this by the performing the gesture does not need to be 
finding the most prominent peaks and valleys for each composited onto a destination image and theo displayed on 

dimension, and then aligning these extrema across all the t . ta „„„*™ av L 

c iL . the computer monitor. The system can, tor example, simply 

dimensions of the new gesture. . . , r *• i r *■ 

s recognize the gesture and perform a particular function 

At 808 the system calculates a probability distribution of 30 based 0Q the meaning 0 f the gesture. In another 

each state or key point in the model. The system has a set of examplej the system can obtain data frames from another 

routines for calculating the statistics at the key points given medium such as a videQ or film created at an earUer time> 

the set of sample gestures. The statistics of interest include mstead of obtaining me data frames from a live figure whose 

the mean and variance values for each dimension of the raoveme nts are captured by a camera in real-time. In yet 

gesture and statistics regarding the timing with respect to the 35 anothcr example> the frame data scl can contain coordinates 

start of the gesture. Using these means and variances, the of of a mov ing subject other than coordinates 

system sets the allowable upper and lower bounds for the specifically for a human body. Furthermore, it should be 

key points, which are used during the recognition phase to noted that there are a]ternative ways of implementing both 

accept or reject the incoming input frame data sets as a the process and apparatus of the presc nt invention, 

possible gesture match. The system will examine the 40 Accordingly, the present embodiments are to be considered 

samples and derive a probability for each key point. For a& illustrative and DOt restrictive, and the invention is not to 

example, if an incoming gesture reaches the third state of a be ]imited to the deUik ^ hereitlj bm may be modified 

four-state gesture, the probability that the incoming gesture within the scope and equivalents of the appended claims, 

will match the newly entered gesture may be 90%. On the What is claimed is* 

other hand, if an incoming gesture meets the newly entered 45 x A computer-implemented method of storing and rec- 

gesture's first state, there may only be a 10% probability that ognizing an aspect of a ^j^t with in an image, the method 

the incoming gesture will match the newly entered gesture. including - 

This is done for each key point in each dimension for the a) ^ a back d modeI b obtaini at least one 

newly entered gesture. r r • 

' frame ol an image; 

At 810 the system refines the model representing the new 50 ,x Li .. j.r . • • . » 

gesture by trying out different threshold values based on a b > obtamm 8 a data frame contai ™S a 

Gaussian distribution. At this stage a first version of the c ) removing background for said data frame based on said 

model has already been created. The system then runs the background model whereby the subject is isolated; 

same data from the initial samples and some extraneous data d) analyzing the data frame thereby determining particular 

that clearly falls outside the model through the model The 55 coordinates of the subject at a particular time while the 

system then determines how much of the first set of data can subject is moving; 

be recognized by the initial model. The thresholds of each e) adding the particular coordinates to a frame data set; 

state are initially set narrowly and are expanded until the and 

model can recognize all the initial samples but not any of the f) examining the particular coordinates such that the 

extraneous data entered that should not fall within the 60 particular coordinates are compared to positional data 

model. The purpose of this is to ensure that the refined model making up a plurality of recognizable aspects, wherein 

is sufficiently broad to recognize all the samples of the a recognizable aspect is made up of at least one 

gesture but not so broad as to accept arbitrary gestures (as dimension such that the positional data describes 

represented by the extraneous data). Essentially, the system dimensions of the recognized aspect, 

is determining what is an acceptable gesture and what is not. 6S 2. A method as recited in claim 1 wherein the computer 

At 812 the system checks if there are anymore new includes a network connection capable of being coupled to 

gestures to be entered into the system by examining frames a network. 
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3. A method as recited in claim 1 wherein the data frame 
is transmitted over the network. 

4. A method as recited in claim 1 wherein the positional 
data is transmitted over the network. 

5. A method as recited in claim 1 and further comprising: 

e) repeating a through d for a plurality of data frames; and 

f) determining whether the plurality of the data frames 
when examined in a particular sequence, conveys a 
subject aspect by the subject that resembles a recog- 
nizable aspect, thereby causing an operation based on 
a predetermined meaning of the recognizable aspect be 
performed by a computer. 

6. A computer program embodied on a computer-readable 
medium that stores and recognizes an aspect of a subject 
within an image, including: 

a) a code segment that builds a background model by 
obtaining at least one frame of an image; 

b) a code segment that obtains a data frame containing a 
subject; 

c) a code segment that subtracts background for said data 
frame based on said background model; 

d) a code segment that analyzes the data frame thereby 
determining particular coordinates of the subject at a 
particular time while the subject is moving; 

e) a code segment that adds the particular coordinates to 
a frame data set; and 

f) a code segment that examines the particular coordinates 
such that the particular coordinates are compared to 
positional data making up a plurality of recognizable 
aspects, wherein a recognizable aspect is made up of at 
least one dimension such that the positional data 
describes dimensions of the recognized aspect. 

7. A computer program embodied on a computer- readable 
medium as recited in claim 6 wherein the computer includes 
a network connection capable of being coupled to a network. 

8. A computer program embodied on a computer- readable 
medium as recited in claim 6 wherein the data frame is 
transmitted over the network. 

9. A computer program embodied on a computer-readable 
medium as recited in claim 6 wherein the positional data is 
transmitted over the network. 

10. A computer program embodied on a computer- 
readable medium as recited in claim 6 and further compris- 
ing: 

e) a code segment that repeats a through d for a plurality 
of data frames; and 

f) a code segment that determines whether the plurality of 
the data frames when examined in a particular 
sequence, conveys a subject aspect by the subject that 
resembles a recognizable aspect, thereby causing an 
operation based on a predetermined meaning of the 
recognizable aspect be performed by a computer. 

11. A computer-implemented method of storing and rec- 
ognizing aspect of a subject within an image, the method 
including: 

building a background model by obtaining at least one 
frame of an image; 

storing a plurality of samples of a subject; 

subtracting background for said data frame based on said 
background model; 

inputting a number of key points that fit in an aspect of the 
subject; inputting a corresponding time value repre- 
senting the time of the aspect of the subject to com- 
plete; and 

wherein the key points and corresponding time value are 
adapted for being used to recognize aspect of the 
subject. 
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12. A method as recited in claim 11 further including: 
determining locations of key points in a model represen- 
tative of the aspect of the subject. 

13. A method as recited in claim 12 further including: 
refining the model such that the plurality of samples of the 

aspect of the subject fit within the model. 

14. A method as recited in claim 11 further including: 
calculating a probability distribution for key points indi- 
cating the likelihood that a certain output will be 
observed. 

15. A method as recited in claim 14 further including: 
calculating a confusion matrix wherein the aspect of the 

subject is compared with previously stored recogniz- 
able aspects so that similarities between the new aspect 
to previously stored recognizable aspect can be deter- 
mined. 

16. A method as recited in claim 11 further including: 
inputting a number of dimensions of the aspect of the 

subject. 

17. A computer program embodied on a computer- 
readable medium that stores and recognizes an aspect of a 
subject within an image, including: 

a code segment that builds a background model by 
obtaining at least one frame of an image; 

a code segment that stores a plurality of samples of a 
subject; 

a code segment that subtracts background for said data 

frame based on said background model; 
a code segment that inputs a number of key points that fit 

in an aspect of the subject; 
a code segment that inputs a corresponding time value 

representing the time of the aspect of the subject to 

complete; and 

wherein the key points and corresponding time value are 
adapted for being used to recognize aspect of the 
subject. 

18. A computer program embodied on a computer- 
readable medium as recited in claim 17 further including: 

a code segment that determines locations of key points in 
a model representative of the aspect of the subject. 

19. A computer program embodied on a computer- 
readable medium as recited in claim 18 further including: 

a code segment that refines the model such that the 
plurality of samples of the aspect of the subject fit 
within the model. 

20. A computer program embodied on a computer- 
readable medium as recited in claim 17 further including: 

a code segment that calculates a probability distribution 
for key points indicating the likelihood that a certain 
output will be observed. 

21. A computer program embodied on a computer- 
readable medium as recited in claim 20 further including: 

a code segment that calculates a confusion matrix wherein 
the aspect of the subject is compared with previously 
stored recognizable aspects so that similarities between 
the new aspect to previously stored recognizable aspect 
can be determined. 

22. A computer program embodied on a computer- 
readable medium as recited in claim 17 further including: 

a code segment that inputs a number of dimensions of the 
aspect of the subject. 
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